SIGN IN SIGN UP

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

0 0 1 Jupyter Notebook
New File Structure Implementation (#4716) * Created archived folder and moved all workshop notebooks with 0 views to archived * Moved 10 notebooks with 0 views into archived/notebooks * Moved 13 0 view notebooks to archived * Deleted 17 duplicate notebooks * Deleted 17 duplicate notebooks (#4685) * Update SMP v2 notebooks to use latest PyTorch 2.3.1, TSM 2.4.0 release (#4678) * Update SMP v2 notebooks to use latest PT2.3.1-TSM2.4.0 release. * Update SMP v2 shared_scripts * Update minimum sagemaker pysdk version to 2.224 * Updated README, removed broken links and fixed markdown (#4687) * Parsash2 patch 1 (#4690) * tutorials-after-initial-feedback Added descriptive text to make the notebooks stand on their own. * move athena notebook into dedicated folder * renamed athena end2end notebooks * moved pyspark notebook into dedicated directory * minor change: consistent directory naming convention * Added overview, headers, and explantory text Tested the notebook end to end. Added more context for processing jobs and cleaning up. The output is visible in the cells. * Added overview, headers, explanatory text Also added troubleshooting note from further testing. * fix directory locations for new notebooks * clear notebook outputs * added integration for ci test results * updated formatting with black-nb * update athena notebook: fix parse predictions * fixed ci integration for pyspark-etl-training notebook --------- Co-authored-by: Janosch Woschitz <jwos@amazon.de> * Archived remaining geospatial example notebooks * Removed geospatial from README.md * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks * Archived outdated example notebooks between 1-90 views * MLflow setup (#4689) * Add SageMaker MLflow examples * Add badges * Add MLflow setup notebook; upgrade SageMaker Python SDK for deployment notebook * Linting * More linting changes --------- Co-authored-by: Bobby Lindsey <bwlind@amazon.com> * feat: Model monitor json support for Explainability and Bias (#4696) * initial commit of Blog content: "using step decorator for bedrock fine tuning" (https://sim.amazon.com/issues/ML-16440) (#4657) * initial commit of using step decorator for bedrock fine tuning * ran black command on the notebook * Added CI badges * Added CI badges * fixed typo in notebook title --------- Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> * New folder structure (#4694) * Deleted 17 duplicate notebooks (#4685) * Updated README, removed broken links and fixed markdown (#4687) * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks (#4692) * Archived outdated example notebooks between 1-90 views (#4693) --------- Co-authored-by: jsmul <jsmul@amazon.com> * Revert "New folder structure (#4694)" (#4701) This reverts commit 970d88ee18a217610c5c7005bcedb8330c41b774 due to broken blog links * archived 17 notebookswith outdated/redundant funtionality * adding notebook for forecast to canvas workshop (#4704) * adding notebook for forecast to canvas workshop * formatting the notebook using black * Adds notebook for deploying and monitoring llm on sagemaker usin fmeval for evaluation (#4705) Co-authored-by: Brent Friedman <brentfr@amazon.com> * archived 20 notebooks with outdated/redundant functionality * archived 20 notebooks with outdated/redundant funtionality * archived 20 notebooks with outdated/redundant funtionality * archived 21 notebooks with outdated/redundant funtionality * archived 19 notebooks with outdated/redundant funtionality * restored pytorch_multi_model_endpoint back from archived * removed redundant notebooks folder from archived - all notebooks now directly in archived * added new folders for new file structure * added gitkeep files to show folders on github * archived one notebook that was missed * introducing new file structure - part 1 * Update README.md * moved unsorted file back to top level to maintain links * archived recently marked, and removed folder names from file names * new file structure: renamed and moved all evaluated notebooks as of 26 july * new file structure: organized new files and files that still need to be evaluated * Update README.md --------- Co-authored-by: Victor Zhu <viczhu@amazon.com> Co-authored-by: parsash2 <60193914+parsash2@users.noreply.github.com> Co-authored-by: Janosch Woschitz <jwos@amazon.de> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Bobby Lindsey <bwlind@amazon.com> Co-authored-by: zicanl-amazon <115581573+zicanl-amazon@users.noreply.github.com> Co-authored-by: ashrawat <ashrawat_atl@yahoo.com> Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> Co-authored-by: pro-biswa <bisu@amazon.com> Co-authored-by: brentfriedman725 <97409987+brentfriedman725@users.noreply.github.com> Co-authored-by: Brent Friedman <brentfr@amazon.com>
2024-07-26 14:42:50 -07:00
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Develop, Train, Register and Batch Transform Scikit-Learn Random Forest\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"\n",
"* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html\n",
"* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html\n",
"* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client\n",
"\n",
"In this notebook we show how to use Amazon SageMaker to train a Scikit-learn Random Forest model, register it in Model Registry, and run a Batch Transform Job. More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:\n",
"\n",
"> Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n",
"\n",
"Link to the paper: https://doi.org/10.1016/S0167-7152(96)00140-X"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"!pip install -U 'sagemaker<3.0'"
New File Structure Implementation (#4716) * Created archived folder and moved all workshop notebooks with 0 views to archived * Moved 10 notebooks with 0 views into archived/notebooks * Moved 13 0 view notebooks to archived * Deleted 17 duplicate notebooks * Deleted 17 duplicate notebooks (#4685) * Update SMP v2 notebooks to use latest PyTorch 2.3.1, TSM 2.4.0 release (#4678) * Update SMP v2 notebooks to use latest PT2.3.1-TSM2.4.0 release. * Update SMP v2 shared_scripts * Update minimum sagemaker pysdk version to 2.224 * Updated README, removed broken links and fixed markdown (#4687) * Parsash2 patch 1 (#4690) * tutorials-after-initial-feedback Added descriptive text to make the notebooks stand on their own. * move athena notebook into dedicated folder * renamed athena end2end notebooks * moved pyspark notebook into dedicated directory * minor change: consistent directory naming convention * Added overview, headers, and explantory text Tested the notebook end to end. Added more context for processing jobs and cleaning up. The output is visible in the cells. * Added overview, headers, explanatory text Also added troubleshooting note from further testing. * fix directory locations for new notebooks * clear notebook outputs * added integration for ci test results * updated formatting with black-nb * update athena notebook: fix parse predictions * fixed ci integration for pyspark-etl-training notebook --------- Co-authored-by: Janosch Woschitz <jwos@amazon.de> * Archived remaining geospatial example notebooks * Removed geospatial from README.md * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks * Archived outdated example notebooks between 1-90 views * MLflow setup (#4689) * Add SageMaker MLflow examples * Add badges * Add MLflow setup notebook; upgrade SageMaker Python SDK for deployment notebook * Linting * More linting changes --------- Co-authored-by: Bobby Lindsey <bwlind@amazon.com> * feat: Model monitor json support for Explainability and Bias (#4696) * initial commit of Blog content: "using step decorator for bedrock fine tuning" (https://sim.amazon.com/issues/ML-16440) (#4657) * initial commit of using step decorator for bedrock fine tuning * ran black command on the notebook * Added CI badges * Added CI badges * fixed typo in notebook title --------- Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> * New folder structure (#4694) * Deleted 17 duplicate notebooks (#4685) * Updated README, removed broken links and fixed markdown (#4687) * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks (#4692) * Archived outdated example notebooks between 1-90 views (#4693) --------- Co-authored-by: jsmul <jsmul@amazon.com> * Revert "New folder structure (#4694)" (#4701) This reverts commit 970d88ee18a217610c5c7005bcedb8330c41b774 due to broken blog links * archived 17 notebookswith outdated/redundant funtionality * adding notebook for forecast to canvas workshop (#4704) * adding notebook for forecast to canvas workshop * formatting the notebook using black * Adds notebook for deploying and monitoring llm on sagemaker usin fmeval for evaluation (#4705) Co-authored-by: Brent Friedman <brentfr@amazon.com> * archived 20 notebooks with outdated/redundant functionality * archived 20 notebooks with outdated/redundant funtionality * archived 20 notebooks with outdated/redundant funtionality * archived 21 notebooks with outdated/redundant funtionality * archived 19 notebooks with outdated/redundant funtionality * restored pytorch_multi_model_endpoint back from archived * removed redundant notebooks folder from archived - all notebooks now directly in archived * added new folders for new file structure * added gitkeep files to show folders on github * archived one notebook that was missed * introducing new file structure - part 1 * Update README.md * moved unsorted file back to top level to maintain links * archived recently marked, and removed folder names from file names * new file structure: renamed and moved all evaluated notebooks as of 26 july * new file structure: organized new files and files that still need to be evaluated * Update README.md --------- Co-authored-by: Victor Zhu <viczhu@amazon.com> Co-authored-by: parsash2 <60193914+parsash2@users.noreply.github.com> Co-authored-by: Janosch Woschitz <jwos@amazon.de> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Bobby Lindsey <bwlind@amazon.com> Co-authored-by: zicanl-amazon <115581573+zicanl-amazon@users.noreply.github.com> Co-authored-by: ashrawat <ashrawat_atl@yahoo.com> Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> Co-authored-by: pro-biswa <bisu@amazon.com> Co-authored-by: brentfriedman725 <97409987+brentfriedman725@users.noreply.github.com> Co-authored-by: Brent Friedman <brentfr@amazon.com>
2024-07-26 14:42:50 -07:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import sys\n",
"\n",
"!{sys.executable} -m pip install 'sagemaker<3.0' scikit-learn==1.2-1 --upgrade"
New File Structure Implementation (#4716) * Created archived folder and moved all workshop notebooks with 0 views to archived * Moved 10 notebooks with 0 views into archived/notebooks * Moved 13 0 view notebooks to archived * Deleted 17 duplicate notebooks * Deleted 17 duplicate notebooks (#4685) * Update SMP v2 notebooks to use latest PyTorch 2.3.1, TSM 2.4.0 release (#4678) * Update SMP v2 notebooks to use latest PT2.3.1-TSM2.4.0 release. * Update SMP v2 shared_scripts * Update minimum sagemaker pysdk version to 2.224 * Updated README, removed broken links and fixed markdown (#4687) * Parsash2 patch 1 (#4690) * tutorials-after-initial-feedback Added descriptive text to make the notebooks stand on their own. * move athena notebook into dedicated folder * renamed athena end2end notebooks * moved pyspark notebook into dedicated directory * minor change: consistent directory naming convention * Added overview, headers, and explantory text Tested the notebook end to end. Added more context for processing jobs and cleaning up. The output is visible in the cells. * Added overview, headers, explanatory text Also added troubleshooting note from further testing. * fix directory locations for new notebooks * clear notebook outputs * added integration for ci test results * updated formatting with black-nb * update athena notebook: fix parse predictions * fixed ci integration for pyspark-etl-training notebook --------- Co-authored-by: Janosch Woschitz <jwos@amazon.de> * Archived remaining geospatial example notebooks * Removed geospatial from README.md * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks * Archived outdated example notebooks between 1-90 views * MLflow setup (#4689) * Add SageMaker MLflow examples * Add badges * Add MLflow setup notebook; upgrade SageMaker Python SDK for deployment notebook * Linting * More linting changes --------- Co-authored-by: Bobby Lindsey <bwlind@amazon.com> * feat: Model monitor json support for Explainability and Bias (#4696) * initial commit of Blog content: "using step decorator for bedrock fine tuning" (https://sim.amazon.com/issues/ML-16440) (#4657) * initial commit of using step decorator for bedrock fine tuning * ran black command on the notebook * Added CI badges * Added CI badges * fixed typo in notebook title --------- Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> * New folder structure (#4694) * Deleted 17 duplicate notebooks (#4685) * Updated README, removed broken links and fixed markdown (#4687) * New Folder Structure Implementation - Archived remaining geospatial example notebooks (#4691) * Archived remaining geospatial example notebooks * Removed geospatial from README.md * Archived remaining workshop notebooks (#4692) * Archived outdated example notebooks between 1-90 views (#4693) --------- Co-authored-by: jsmul <jsmul@amazon.com> * Revert "New folder structure (#4694)" (#4701) This reverts commit 970d88ee18a217610c5c7005bcedb8330c41b774 due to broken blog links * archived 17 notebookswith outdated/redundant funtionality * adding notebook for forecast to canvas workshop (#4704) * adding notebook for forecast to canvas workshop * formatting the notebook using black * Adds notebook for deploying and monitoring llm on sagemaker usin fmeval for evaluation (#4705) Co-authored-by: Brent Friedman <brentfr@amazon.com> * archived 20 notebooks with outdated/redundant functionality * archived 20 notebooks with outdated/redundant funtionality * archived 20 notebooks with outdated/redundant funtionality * archived 21 notebooks with outdated/redundant funtionality * archived 19 notebooks with outdated/redundant funtionality * restored pytorch_multi_model_endpoint back from archived * removed redundant notebooks folder from archived - all notebooks now directly in archived * added new folders for new file structure * added gitkeep files to show folders on github * archived one notebook that was missed * introducing new file structure - part 1 * Update README.md * moved unsorted file back to top level to maintain links * archived recently marked, and removed folder names from file names * new file structure: renamed and moved all evaluated notebooks as of 26 july * new file structure: organized new files and files that still need to be evaluated * Update README.md --------- Co-authored-by: Victor Zhu <viczhu@amazon.com> Co-authored-by: parsash2 <60193914+parsash2@users.noreply.github.com> Co-authored-by: Janosch Woschitz <jwos@amazon.de> Co-authored-by: Bobby Lindsey <bobbywlindsey@users.noreply.github.com> Co-authored-by: Bobby Lindsey <bwlind@amazon.com> Co-authored-by: zicanl-amazon <115581573+zicanl-amazon@users.noreply.github.com> Co-authored-by: ashrawat <ashrawat_atl@yahoo.com> Co-authored-by: Ashish Rawat <rawataws@amazon.com> Co-authored-by: Zhaoqi <jzhaoqwa@amazon.com> Co-authored-by: pro-biswa <bisu@amazon.com> Co-authored-by: brentfriedman725 <97409987+brentfriedman725@users.noreply.github.com> Co-authored-by: Brent Friedman <brentfr@amazon.com>
2024-07-26 14:42:50 -07:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import datetime\n",
"import time\n",
"import tarfile\n",
"\n",
"import boto3\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sagemaker import get_execution_role\n",
"import sagemaker\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.datasets import fetch_california_housing\n",
"\n",
"\n",
"s3 = boto3.client(\"s3\")\n",
"sm_boto3 = boto3.client(\"sagemaker\")\n",
"\n",
"sess = sagemaker.Session()\n",
"\n",
"region = sess.boto_session.region_name\n",
"\n",
"bucket = sess.default_bucket() # this could also be a hard-coded bucket name\n",
"\n",
"print(\"Using bucket \" + bucket)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Prepare data\n",
"\n",
"We use the California housing dataset.\n",
"\n",
"More info on the dataset:\n",
"\n",
"This dataset was obtained from the `StatLib` repository. http://lib.stat.cmu.edu/datasets/\n",
"\n",
"The target variable is the median house value for California districts.\n",
"\n",
"This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"s3 = boto3.client(\"s3\")\n",
"s3.download_file(\n",
" f\"sagemaker-example-files-prod-{region}\",\n",
" \"datasets/tabular/california_housing/cal_housing.tgz\",\n",
" \"cal_housing.tgz\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"!tar -zxf cal_housing.tgz"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"columns = [\n",
" \"longitude\",\n",
" \"latitude\",\n",
" \"housingMedianAge\",\n",
" \"totalRooms\",\n",
" \"totalBedrooms\",\n",
" \"population\",\n",
" \"households\",\n",
" \"medianIncome\",\n",
" \"medianHouseValue\",\n",
"]\n",
"california_housing_df = pd.read_csv(\n",
" \"CaliforniaHousing/cal_housing.data\", names=columns, header=None\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"california_housing_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"x_train, x_test = train_test_split(california_housing_df, test_size=0.25)\n",
"\n",
"x_eval = x_test[\n",
" [\n",
" \"longitude\",\n",
" \"latitude\",\n",
" \"housingMedianAge\",\n",
" \"totalRooms\",\n",
" \"totalBedrooms\",\n",
" \"population\",\n",
" \"households\",\n",
" \"medianIncome\",\n",
" ]\n",
"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's inspect the training dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"x_train.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"x_train.shape"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Save training, testing and evaluation data as csv and upload to S3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"x_train.to_csv(\"california_housing_train.csv\")\n",
"x_test.to_csv(\"california_housing_test.csv\")\n",
"x_eval.to_csv(\"california_housing_eval.csv\", header=False, index=False)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Upload training and evaluation data to S3, as SageMaker Training Job, and afterward, Batch Transform Job will take it from there."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"trainpath = sess.upload_data(\n",
" path=\"california_housing_train.csv\", bucket=bucket, key_prefix=\"sagemaker/sklearn-train\"\n",
")\n",
"\n",
"testpath = sess.upload_data(\n",
" path=\"california_housing_test.csv\", bucket=bucket, key_prefix=\"sagemaker/sklearn-train\"\n",
")\n",
"\n",
"print(trainpath)\n",
"print(testpath)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sess.upload_data(\n",
" path=\"california_housing_eval.csv\", bucket=bucket, key_prefix=\"sagemaker/sklearn-eval\"\n",
")\n",
"\n",
"eval_s3_prefix = f\"s3://{bucket}/sagemaker/sklearn-eval/\"\n",
"eval_s3_prefix"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Writing a *Script Mode* script\n",
"The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on premise, etc.). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"%%writefile script.py\n",
"\n",
"import argparse\n",
"import joblib\n",
"import os\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"\n",
"# inference functions ---------------\n",
"def model_fn(model_dir):\n",
" clf = joblib.load(os.path.join(model_dir, \"model.joblib\"))\n",
" return clf\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" print(\"extracting arguments\")\n",
" parser = argparse.ArgumentParser()\n",
"\n",
" # hyperparameters sent by the client are passed as command-line arguments to the script.\n",
" # to simplify the demo we don't use all sklearn RandomForest hyperparameters\n",
" parser.add_argument(\"--n-estimators\", type=int, default=10)\n",
" parser.add_argument(\"--min-samples-leaf\", type=int, default=3)\n",
"\n",
" # Data, model, and output directories\n",
" parser.add_argument(\"--model-dir\", type=str, default=os.environ.get(\"SM_MODEL_DIR\"))\n",
" parser.add_argument(\"--train\", type=str, default=os.environ.get(\"SM_CHANNEL_TRAIN\"))\n",
" parser.add_argument(\"--test\", type=str, default=os.environ.get(\"SM_CHANNEL_TEST\"))\n",
" parser.add_argument(\"--train-file\", type=str, default=\"california_housing_train.csv\")\n",
" parser.add_argument(\"--test-file\", type=str, default=\"california_housing_test.csv\")\n",
" parser.add_argument(\n",
" \"--features\", type=str\n",
" ) # in this script we ask user to explicitly name features\n",
" parser.add_argument(\n",
" \"--target\", type=str\n",
" ) # in this script we ask user to explicitly name the target\n",
"\n",
" args, _ = parser.parse_known_args()\n",
"\n",
" print(\"reading data\")\n",
" train_df = pd.read_csv(os.path.join(args.train, args.train_file))\n",
" test_df = pd.read_csv(os.path.join(args.test, args.test_file))\n",
"\n",
" print(\"building training and testing datasets\")\n",
" X_train = train_df[args.features.split()]\n",
" X_test = test_df[args.features.split()]\n",
" y_train = train_df[args.target]\n",
" y_test = test_df[args.target]\n",
"\n",
" # train\n",
" print(\"training model\")\n",
" model = RandomForestRegressor(\n",
" n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1\n",
" )\n",
"\n",
" model.fit(X_train, y_train)\n",
"\n",
" # print abs error\n",
" print(\"validating model\")\n",
" abs_err = np.abs(model.predict(X_test) - y_test)\n",
"\n",
" # print couple perf metrics\n",
" for q in [10, 50, 90]:\n",
" print(\"AE-at-\" + str(q) + \"th-percentile: \" + str(np.percentile(a=abs_err, q=q)))\n",
"\n",
" # persist model\n",
" path = os.path.join(args.model_dir, \"model.joblib\")\n",
" joblib.dump(model, path)\n",
" print(\"model persisted at \" + path)\n",
" print(args.min_samples_leaf)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Launching a SageMaker training job with the Python SDK\n",
"\n",
"We will train two models: the first with 100 epochs, and the second with 300 epochs. The number of epochs has no specific meaning. We are interested in training two models, so we will be able to register each one of them into SageMaker Model Registry."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Launch the 1st training job\n",
"\n",
"Once we've defined our estimator, we can specify the hyperparameters we'd like to tune and their possible values.\n",
"This time we will train with 100 epochs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# We use the Estimator from the SageMaker Python SDK\n",
"from sagemaker.sklearn.estimator import SKLearn\n",
"\n",
"FRAMEWORK_VERSION = \"1.2-1\"\n",
"training_job_1_name = \"sklearn-california-housing-1\"\n",
"\n",
"sklearn_estimator_1 = SKLearn(\n",
" entry_point=\"script.py\",\n",
" role=get_execution_role(),\n",
" instance_count=1,\n",
" instance_type=\"ml.c5.xlarge\",\n",
" framework_version=FRAMEWORK_VERSION,\n",
" base_job_name=training_job_1_name,\n",
" metric_definitions=[{\"Name\": \"median-AE\", \"Regex\": \"AE-at-50th-percentile: ([0-9.]+).*$\"}],\n",
" hyperparameters={\n",
" \"n-estimators\": 100,\n",
" \"min-samples-leaf\": 3,\n",
" \"features\": \"longitude latitude housingMedianAge totalRooms totalBedrooms population households medianIncome\",\n",
" \"target\": \"medianHouseValue\",\n",
" },\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_estimator_1.fit({\"train\": trainpath, \"test\": testpath})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Create a Model Package Group for the trained model to be registered\n",
"\n",
"Create a new Model Package Group or use an existing one to register the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"client = boto3.client(\"sagemaker\")\n",
"\n",
"model_package_group_name = \"sklearn-california-housing-\" + str(round(time.time()))\n",
"model_package_group_input_dict = {\n",
" \"ModelPackageGroupName\": model_package_group_name,\n",
" \"ModelPackageGroupDescription\": \"My sample sklearn model package group\",\n",
"}\n",
"\n",
"create_model_pacakge_group_response = client.create_model_package_group(\n",
" **model_package_group_input_dict\n",
")\n",
"model_package_arn = create_model_pacakge_group_response[\"ModelPackageGroupArn\"]\n",
"print(f\"ModelPackageGroup Arn : {model_package_arn}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Register the model of the 1st training job in the Model Registry\n",
"Once the model is registered, you will see it in the Model Registry tab of the SageMaker Studio UI. The model is registered with the `approval_status` set to \"Approved\". By default, the model is registered with the `approval_status` set to `PendingManualApproval`. Users can then navigate to the Model Registry to manually approve the model based on any criteria set for model evaluation or this can be done via API."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"inference_instance_type = \"ml.m5.xlarge\"\n",
"model_package_1 = sklearn_estimator_1.register(\n",
" model_package_group_name=model_package_arn,\n",
" inference_instances=[inference_instance_type],\n",
" transform_instances=[inference_instance_type],\n",
" content_types=[\"text/csv\"],\n",
" response_types=[\"text/csv\"],\n",
" approval_status=\"Approved\",\n",
")\n",
"\n",
"model_package_arn_1 = model_package_1.model_package_arn\n",
"print(\"Model Package ARN : \", model_package_arn_1)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Create a transform job with the default configurations from the model of the 1st training job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_1_transformer = model_package_1.transformer(\n",
" instance_count=1, instance_type=inference_instance_type\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_1_transformer.transform(eval_s3_prefix, split_type=\"Line\", content_type=\"text/csv\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's inspect the output of the Batch Transform job in S3. It should show the median income in block group."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_1_transformer.output_path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"output_file_name = \"california_housing_eval.csv.out\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"!aws s3 cp {sklearn_1_transformer.output_path}/{output_file_name} ."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"pd.read_csv(output_file_name, sep=\",\", header=None)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Launch the 2nd training job\n",
"\n",
"This time we will train with 300 epochs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"training_job_2_name = \"sklearn-california-housing-2\"\n",
"\n",
"sklearn_estimator_2 = SKLearn(\n",
" entry_point=\"script.py\",\n",
" role=get_execution_role(),\n",
" instance_count=1,\n",
" instance_type=\"ml.c5.xlarge\",\n",
" framework_version=FRAMEWORK_VERSION,\n",
" base_job_name=training_job_2_name,\n",
" metric_definitions=[{\"Name\": \"median-AE\", \"Regex\": \"AE-at-50th-percentile: ([0-9.]+).*$\"}],\n",
" hyperparameters={\n",
" \"n-estimators\": 300,\n",
" \"min-samples-leaf\": 3,\n",
" \"features\": \"longitude latitude housingMedianAge totalRooms totalBedrooms population households medianIncome\",\n",
" \"target\": \"medianHouseValue\",\n",
" },\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_estimator_2.fit({\"train\": trainpath, \"test\": testpath})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Register the model of 2nd training job in the Model Registry"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"inference_instance_type = \"ml.c5.xlarge\"\n",
"model_package_2 = sklearn_estimator_2.register(\n",
" model_package_group_name=model_package_arn,\n",
" inference_instances=[inference_instance_type],\n",
" transform_instances=[inference_instance_type],\n",
" content_types=[\"text/csv\"],\n",
" response_types=[\"text/csv\"],\n",
" approval_status=\"Approved\",\n",
")\n",
"\n",
"model_package_arn_2 = model_package_2.model_package_arn\n",
"print(\"Model Package ARN : \", model_package_arn_2)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### View Model Groups and Versions\n",
"\n",
"You can view details of a specific model version by using either the AWS SDK for Python (Boto3) or by using Amazon SageMaker Studio.\n",
"To view the details of a model version by using Boto3, call the `list_model_packages` method to view the model versions in a model group\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"list_model_packages_response = client.list_model_packages(ModelPackageGroupName=model_package_arn)\n",
"list_model_packages_response"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's fetch the latest model version from the Model Package Group"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"latest_model_version_arn = list_model_packages_response[\"ModelPackageSummaryList\"][0][\n",
" \"ModelPackageArn\"\n",
"]\n",
"print(latest_model_version_arn)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### View the latest Model Version details\n",
"\n",
"Call `describe_model_package` to see the details of the model version. You pass in the ARN of a model version that you got in the output of the call to `list_model_packages`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"client.describe_model_package(ModelPackageName=latest_model_version_arn)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Create a transform job with the default configurations from the model of the 2nd training job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_2_transformer = model_package_2.transformer(\n",
" instance_count=1, instance_type=inference_instance_type\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_2_transformer.transform(eval_s3_prefix, split_type=\"Line\", content_type=\"text/csv\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Let's inspect the output locations of both Batch Transform jobs in S3. You can see they have different locations due to their separate Batch Transform jobs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_1_transformer.output_path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_2_transformer.output_path"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Conclusion\n",
"\n",
"In this notebook you successfully downloaded the California housing dataset and trained a model using SageMaker Python SDK.\n",
"Then you created a `ModelPackageGroup`, registered the Model Version in SageMaker Model Registry, and triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3.\n",
"\n",
"You trained another model, this time with 300 epochs, registered this Model Version in SageMaker Model Registry, viewed the model versions, and again, triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3. \n",
"\n",
"As next steps, you can try registering your own model in SageMaker Model Registry, and run a SageMaker Batch Transform Job on data you have on S3."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n",
"\n",
"![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-python-sdk|scikit_learn_model_registry_batch_transform|scikit_learn_model_registry_batch_transform.ipynb)\n"
]
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science 3.0)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}